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Abstract 

Spectral clustering is a novel clustering method which can 
detect complex shapes of data clusters. However, it requires 
the eigen decomposition of the graph Laplacian matrix, 
which is proportion to 0(n 3 ) and thus is not suitable for 
large scale systems. Recently, many methods have been 
proposed to accelerate the computational time of spectral 
clustering. These approximate methods usually involve 
sampling techniques by which a lot information of the 
original data may be lost. In this work, we propose a 
fast and accurate spectral clustering approach using an 
approximate commute time embedding, which is similar to 
the spectral embedding. The method does not require using 
any sampling technique and computing any eigenvector at 
all. Instead it uses random projection and a linear time 
solver to find the approximate embedding. The experiments 
in several synthetic and real datasets show that the proposed 
approach has better clustering quality and is faster than the 
state-of-the-art approximate spectral clustering methods. 

Keyword: spectral clustering, commute time embed- 
ding, random projection, linear time solver 

1 Introduction 

Data clustering is an important problem and has been 
studied extensively in data mining research [IT]. Tra- 
ditional methods such as fc-means or hierarchical tech- 
niques usually assume that data has clusters of convex 
shapes so that using Euclidean distance they can lin- 
early separate them. On the other hand, spectral clus- 
tering can detect clusters of more complex geometry and 
has been shown to be more effective than traditional 
techniques in different application domains [20l [24] [17] . 
The intuition of spectral clustering is that it maps the 
data in the original feature space to the eigenspace of 
the Laplacian matrix where we can linearly separate 
the clusters and thus the clusters are easier to be de- 
tected using traditional techniques like A:-means. This 
technique requires the eigen decomposition of the graph 
Laplacian which is proportional to 0(n 3 ) and is not ap- 
plicable for large graphs. 

Recent studies try to solve this problem by acceler- 
ating the eigen decomposition step. They either involves 
sampling or low-rank matrix approximation techniques 
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[3 [301 EU H]- [3 used traditional Nystrom method to 
solve the eigensystem solution on data representatives 
which were sampled randomly and then extrapolated 
the solution for the whole dataset. [31] performed the 
spectral clustering on a small set of data centers cho- 
sen by fc-means or a random projection tree. Then all 
data points were assigned to clusters corresponding to 
its centers in the center selection step. A recent work 
in [4] used the idea of sparse coding to approximate 
the affinity matrix based on a number of data repre- 
sentatives so that they can compute the eigensystem 
very efficiently. However, all of them involve sampling 
techniques. Although the samples or representatives 
are chosen uniformly at random or by using a more ex- 
pensive selection, it may not completely represent the 
whole dataset and may not correctly capture the clus- 
ter geometry structures. Moreover, all of them involves 
computing the eigenvectors of the Laplacian and can- 
not be used directly in graph data which are popularly 
available such as social networks, web graphs, and col- 
laborative filtering graphs. 

In this paper, we propose a different approach using 
an approximate commute time embedding. Commute 
time is a random walk based metric on graphs. The 
commute time between two nodes i and j is the expected 
number of steps a random walk starting at i will take 
to reach j for the first time and then return back 
to i. The fact that commute time is averaged over 
all paths (and not just the shortest path) makes it 
more robust to data perturbations. Commute time has 
found widespread applications in personalized search 
[25] , collaborative filtering [HH], anomaly detection [13], 
link prediction in social network [16 , and making search 
engines robust against manipulation |10| . Commute 
time can be embedded in an eigenspace of the graph 
Laplacian matrix where the square pairwise Euclidean 
distances are the commute time in the similarity graph 
[6]. Therefore, the clustering using commute time 
embedding has similar idea to spectral clustering and 
they have quite similar clustering capability. 

Another kind of study in [19] proposed a semi- 
supervised framework using data labels to improve the 
efficiency of the power method in finding eigenvectors 



for spectral clustering. Alternatively, [3] used paral- 
lel processing to accelerate spectral clustering in a dis- 
tributed environment. In our work, we only focus on the 
acceleration of spectral clustering using a single machine 
in an unsupervised manner. 

The contributions of this paper are as follows: 

• We show the similarity in idea and implementation 
between spectral clustering and clustering using 
commute time embedding. The experiments show 
that they have quite similar clustering capabilities. 

• We show the weakness of sampling-based approxi- 
mate approaches and propose a fast and accurate 
spectral clustering method using approximate com- 
mute time embedding. This does not sample the 
data, does not compute any eigenvector, and can 
work directly in graph data. Moreover, the approx- 
imate embedding can be applied to different other 
applications which utilized the commute time. 

• We show the effectiveness of the proposed methods 
in terms of accuracy and performance in several 
synthetic and real datasets. It is more accurate 
and faster than the state-of-the-art approximate 
spectral clustering methods. 

The remainder of the paper is organized as follows. 
Sections [2] and describe the spectral clustering tech- 
nique and efforts to approximate it to reduce the com- 
putational time. Section |4] reviews notations and con- 
cepts related to commute time and its embedding, and 
the relationship between spectral clustering and clus- 
tering using commute time embedding. In Section [5j 
we present a method to approximate spectral cluster- 
ing with an approximate commute time embedding. In 
Section [5] we evaluate our approach using experiments 
on several synthetic and real datasets. Sections [7] cov- 
ers the discussion of the related issues. We conclude 
in Section [8] with a summary and a direction for future 
research. 

2 Spectral Clustering 

Given a dataset X £ M. d with n data points 
xi,X2, ■ ■ ■ ,x n and d dimensions, we define an undirected 
and weighted graph G. Let A = u>jj(l < i,j < n) be 
the affinity matrix of G. 

Let i be a node in G and N(i) be its neighbors. The 
degree di of a node i is YljeNti) w v- The vo ^ ume Vg of 
the graph is defined as Y^i=i d%- Let D be the diagonal 
degree matrix with diagonal entries di. The Laplacian 
of G is the matrix L = D — A. 

Spectral clustering assigns each data point in X to 
one of fc clusters. The details are in Algorithm Q] 



Algorithm 1 Spectral Clustering 

Input: Data matrix X £ M. d , number of clusters fc 
Output: Cluster membership for each data point 

l: Construct a similarity graph G from X and compute 

its Laplacian matrix L 
2: Compute the first k eigenvectors of L. 
3: Let U £ K fc be the eigenspace containing these k 

vectors as columns and each row of U corresponds 

to a data point in X. 
4: Cluster the points in U using fc-means clustering. 



There are three typical similarity graphs: the e- 
neighborhood graph (connecting nodes whose distances 
are shorter than e) , the fully connected graph (connect- 
ing all nodes with each other), and the k- nearest neigh- 
bor graph (connecting nodes u and v if u belongs to k 
nearest neighbors of v or v belongs to k nearest neigh- 
bors of u) [17] . The e-neighborhood graph and k- nearest 
neighbor graph (fc <C n) are usually sparse, which have 
advantages in computation. The typical similarity func- 

tion is the Gaussian kernel function Wij = e ^ 
where a is the kernel bandwidth. 

Algorithm [T] shows that spectral clustering trans- 
forms the data from its original space to the eigenspace 
of the Laplacian matrix and uses fc-means to cluster 
data in that space. The representation in the new space 
enhances the cluster properties in the data so that the 
clusters can be linearly separated [17]. Therefore, tra- 
ditional technique like fc-means can easily cluster data 
in the new space. 

We can use the normalized Laplacian matrix and 
its corresponding eigenvectors as the eigenspace. Shi 
and Malik [24] computed the first fc eigenvectors of 
the generalized eigensystem as the eigenspace. These 
eigenvectors are in fact the eigenvectors of the nor- 
malized Laplacian L n = D^ 1 L [17] . Ng, Jordan, and 
Weiss use fc eigenvectors of the normalized Laplacian 
L n = D^I^LD^ 1 ! 2 as the eigenspace. It then requires 
the normalization of each row in the new space to norm 
1 [20]. 

3 Related Work 

The spectral clustering method described in previous 
section involves the eigen decomposition of the (normal- 
ized) Laplacian matrix. It takes 0(n 3 ) time and is not 
feasible to do for large graphs. Even if we can reduce 
it by using a sparse similarity graph and a sparse eigen 
decomposition algorithm with an iterative approach, it 
is still expensive for large graphs. 

Most of the approaches try to approximate spectral 



clustering using sampling or low-rank approximation 
techniques. [7] used Nystrom technique and [30] used 
column sampling to solve the eigensystem in a smaller 
sample and extrapolated the solution for the whole 
dataset. 

|31) provided a framework for a fast approximate 
spectral clustering. A number of centers were chosen 
from the data by using fc-means or a random projection 
tree. Then these centers were clustered by the spectral 
clustering. The cluster membership for each data 
point corresponding to its center was assigned using 
the spectral clustering membership in the center set. 
However, the center selection step is time consuming 
for large datasets. 

[1] used the idea of sparse coding to design an ap- 
proximate affinity matrix A — ZZ T (Z 6 R s where s 
is the number of representatives, or landmarks in their 
word) so that the eigen decomposition of an (n x n) ma- 
trix A can be found from the eigen decomposition of a 
smaller (s x s) matrix Z T Z . Since the smallest eigen- 
vectors of L n = D~ x l 2 LD~ X / 2 are the largest eigen- 
vectors of D^ 1 / 2 AD~ X / 2 , we have the eigen solution of 
L n . s landmarks can be selected by random sampling 
or by fc-means method. They claimed that choosing the 
landmarks by randomly sampling is a balance between 
accuracy and performance. 

However, all these methods involve data sampling 
either by choosing randomly or by a fc-means selection. 
Using fc-means or other methods to select the represen- 
tative centers is costly in large datasets since the number 
of representatives cannot be too small. Moreover, any 
kind of sampling will suffer from the lost of informa- 
tion in the original data since the representatives may 
not completely represent the whole dataset and may 
not correctly capture the cluster geometry structures. 
Therefore, any approximation based on these represen- 
tatives also suffers from this information lost. These 
facts will be illustrated in the experiments. Moreover, 
these approximations cannot be used directly for graph 
data. 

4 Commute Time Embedding and Spectral 
Clustering 

This section reviews the concept of the commute time, 
its embedding and the relationship between clustering 
in the commute time embedding and spectral clustering. 

Definition 1. The Hitting Time hij is the expected 
number of steps that a random walk starting at i will 
take before reaching j for the first time. 

Definition 2. The Hitting Time can be defined in 



terms of the recursion 

1 otherwise 

where 

{Wij/di if (i,j) belong to an edge 
otherwise 

Definition 3. The Commute Time Cij between two 
nodes i and j is given by Cij = hij + hji . 

FACT 4.1. Commute time is a metric: (i) Ca = 0, (ii) 
Cij = Cji and (Hi) c^ < c lk + c k j fL/fl . 

Fact 4.2. 1. Let ei be the n dimensional column 
vector with a 1 at location i and zero elsewhere. 

2. Let (Ai, Vi) be the eigenpair of L for all nodes i, i.e., 
Lvi = X l v l . 

3. It is well known that \\ = 0, V\ = (1, 1, . . . , 1) T and 
all A 4 > 0. 

4- Assume = Ai < A2 . . . < A„. 

5. The eigen decomposition of the Laplacian is L = 
VSV T where S — diag(Xi, A2, . . . , A„) and V — 
(vi,v 2 ,-..,v n ). 

6. Then the pseudo-inverse of L denoted by L + is 

L+ = ±L VivT 

z=2 1 

Remarkably, the commute time can be expressed in 
terms of the Laplacian of G [6j [5] . 

Fact 4.3. 

(4.1) dj = V G (l++l+~2l+) = V G {e l -e ] fL + {e l -e j ) 

where ifj is the (i,j) element of L + Jfjj/. 

Theorem 4.1. 6 = ^JVgVS- 1 ! 2 G K" is a commute 
time embedding where the square root of the commute 
time yfc~ij is an Euclidean distance between i and j in 
6. 

Proof. Equation l4.1l can be written as: 

c tJ = V G (e.i - ej) T L + (ei - ej) 

= V G {e l -e ] ) T VS- 1 V T {e t -e 3 ) 

= V G ( ei - ejfVS-^S- 1 ^^ - ej) 

= [VVcS-V^iei - e.OH^-VV^ - e,)]. 

Thus the commute time is the square pairwise 
Euclidean distance between column vectors in space 
t/V g ~S~ 1/2 V t or row vectors in space 9 = ^/V G ~VS- 1/2 . 



Since the commute time can capture the geometry 
structure in the data, using /c-means in the embedding 
9 can effectively capture the complex clusters. This 
is very similar to the idea of spectral clustering. The 
commute time is a novel metric capturing the data 
geometry structure and is embedded in the Laplacian 
eigenspace. Alternatively, spectral clustering maps the 
original data to the eigenspace of the Laplacian where 
the clusters can be linearly separated. However, there 
are some differences between Commute time Embedding 
Spectral Clustering (denoted as CESC) and spectral 
clustering. 

• Spectral clustering only uses k eigenvectors of V 
while CESC uses all the eigenvectors. 

• The eigenspace in CESC is scaled by the eigenval- 
ues of the Laplacian. 

In case of the normalized Laplacian L n = 
D~ 1 I 2 LD~ 1 / 2 , the embedding for the commute time 
is 6 n = y/VaD^^VnSn 1 ^ [21] where V n and S n are 
the matrix containing eigenvectors and eigenvalues of 
L n . The normalized eigenspace is scaled more with the 
degree matrix. However, since the commute time is a 
metric independent to the Laplacian and fc-means in the 
eigenspace uses the square Euclidean distance which is 
the commute time in the graph space, CESC is inde- 
pendent to the use of the normalized or unnormalized 
Laplacian. 

5 Approximate Commute Time Embedding 
Clustering 

The embedding 9 = \/VgV 'S~ x / 2 is costly to create 
since it take 0(n 3 ) for the eigen decomposition of L. 
Even if we can make use the sparsity of L in sparse 
graph by computing a few smallest eigenvectors of L 
[2"2"] using Lanczos method [S], the method is still hard 
to converge and thus is inefficient for large graphs. We 
adopt the idea in |25j to approximate the commute time 
embedding more efficiently. Speilman and Srivastava 
|25j used random projection and the linear time solver 
of Speilman and Teng [25] [STJ to build a structure where 
we can compute the compute time between two nodes 
in 0(log n) time. 

Fact 5.1. Let m be the number of edges in G. If the 
edges in G are oriented, B mxn given by: 

{1 if v is u's head 
— 1 if v is u's tail 
otherwise 

is a signed edge-vertex incidence matrix and W mxm is 
a diagonal matrix whose entries are the edge weights. 
Then L = B T WB. 



Lemma 5.1. (JSSfl 9 = ^/V G ~L+B T W 1 / 2 e R m is a 
commute time embedding where the square root of the 
commute time ^/ckj is an Euclidean distance between i 
and j in 9. 

Proof. From Equation 14.11 

Cij = V G (ei - e j ) T L + {e i - e 3 ) 
= V G (e i -e j ) T L+LL + {e i -e j ) 
= V G (e l - e 3 ) T L+B T WBL+{e l - e 3 ) 
= V G [(ei - e j ) T L+B T W 1 ^]\W 1 f a BL+(e i - e 3 )] 
= {Vv G ~W 1 / 2 BL+(e t - e^fi^VcW^BL+iei - e 3 ) 

Thus the commute time is the square pairwise Eu- 
clidean distance between column vectors in the space 
y/V3W 1/2 BL + or between row vectors in space 9 = 
y/VGL+B T W^ 2 e w 71 . 

These distances are preserved under the Johnson- 
Lindenstrauss Lemma if we project a row vector in 9 
onto a subspace spanned by k^p — O(logn) random 
vectors [12]. We can use a random matrix Qk RP xm 
where Q{i,j) — ±1/ \Jkpp with equal probabilities 
regarding the following lemma. 

Lemma 5.2. (fTj). Given fix vectors v\,...,v n G M. d 
and e > 0, let Qk RP xd be a random matrix so that 
Q(i,j) = ±l/Vk~RP with k RP = (9(logn/e 2 ). With 
probability at least 1 — 1/n: 

(I - e)\\v, t - v 3 \\ 2 < WQvi-Qvjf < (1 + e)\\ Vi - Vj \\ 2 

for all pairs i,j € G. 

Theorem 5.1. (JE^). Given e > and a matrix 
Zo(\ogn/e 2 )xn = ^/VqQW 1 / 2 BL + , with probability at 
least 1 — 1/n: 

(1 - e)cy < \\Z( ei - ej -)|| 2 < (1 + e)cij 
for all pairs i,j € G. 

Proof. The proof comes directly from Lemmas 15.11 and 
ET21 

Therefore we are able to construct a matrix Z = 
VVgQW 1/2 BL+ which c i3 « ||Z(e i - e 3 )\\ 2 with an 
error e. Since to compute L + directly is expensive, the 
linear time solver of Spielman and Teng [26[ [27] is used 
instead. First, Y = \/VgQW 1 I 2 B is computed. Then 
each of kpp = O(logn) rows of Z (denoted as z.;) is 
computed by solving the system ZiL — yi where yi is a 
row of Y . The linear time solver of Spielman and Teng 
takes only 0(m) time to solve the system [25] . 



Since \\zi — Zi\\p < where Zi is the solution 

of ZiL = yi using the linear time solver |25j we have: 

(5.2) (1 - ef ClJ < \\Z(ei - e 3 )\\ 2 < (1 + efc l3 

where Z is the matrix containing row vector Zi- 

Equation 15.21 shows that the approximate spectral 
clustering using approximate commute time embedding 
by combining random projection and a linear time solver 
has the error e 2 . The method is detailed in Algorithm 

m 

Algorithm 2 Commute time Embedding Spectral 
Clustering (CESC) 

Input: Data matrix X e M. d , number of clusters k, 

number of random vectors kpp 

Output: Cluster membership for each data point 

1: Construct a k\ -nearest neighbor graph G from X 
with Gaussian kernel similarity (k\ <C n). 

2: Compute matrices B, W, and L from G. 

3: Compute Y = VVgQW 1/2 B where Q is an 
il/v^RP random matrix. 

4: Compute all rows z% of Zk RP xn = YL + by kpp calls 
to the Spielman-Teng solver. 

5: Cluster the points in Z T using fc-means clustering. 



In Algorithm EJ 9 = Z T G ^«p=o(io g n) is the 
embedding space where the square pair wise Euclidean 
distance is the approximate commute time. Applying 
fc-means in 8 is a novel way to accelerate spectral 
clustering without using any sampling technique and 
computing any eigenvector. Moreover, the approximate 
embedding is guaranteed with the error bound e 2 and 
the method can be applied directly in graph data. 

5.1 Analysis Here we analyze the computational 
complexity of the proposed method. Firstly the k\- 
nearest neighbor graph is constructed in 0(n log n) time 
using kd-tree. Y = \JVqQW 1 I 2 B is computed in 
0(2mknp + to) = 0(mkpp) time since there are only 
2m nonzeros in B and W is a diagonal matrix with m 
nonzeros. Then each of kpp rows of Z (denoted as Zi) is 
computed by solving the system ZiL = y,; in 0{m) time 
where yi is a row of Y. Since we use fci-nearest neighbor 
graph where k\ <C n, 0{m) = 0(n). Therefore, the con- 
struction of Z takes 0(nkpp) time, k- means algorithm 
inZ T takes 0(tkkppn) where k is the number of clus- 
ters and t is the number of iterations for the algorithm 
to be converged. 

The summary of the analysis of CESC and other 
approximate spectral clustering techniques is in Table 
[TJ All methods create the embedded space where they 



use /c-means to cluster the data. The dimension of 
the embedding of Nystrom, KASP, and LSC is k - the 
number of clusters. For CESC, it is kpp. 

Note that in practise, kpp — 0(\ogn/e 2 ) is small 
and does not have much differences between different 
datasets. We will discuss it in the experimental section. 
We can choose kpp <C n. Moreover, the performance of 
the linear time solver is observed to be linear empirically 
instead of 0{m) [TS]. Therefore, the construction of Z 
takes only 0(nkpp) in practise. 

On the contrary, the number of representatives s 
cannot be very small in order to correctly represent 
the whole dataset. Therefore, the term 0(s 3 ) cannot 
be ignored. It is shown in the experiment that CESC 
is faster than all other methods while still maintaining 
better quality in clustering results. 

6 Experimental Results 

6.1 Evaluation criteria We report on the experi- 
ments carried out to determine and compare the effec- 
tiveness of the Nystrom, KASP, LSC, and CESC meth- 
ods. It included the clustering accuracy (percentage) 
and the computational time (second). For accuracy, it 
was measured against spectral clustering as the bench- 
mark method since all of them are its approximations. 
The accuracy was computed by counting the fraction of 
matching between cluster memberships of spectral clus- 
tering and the approximate method, given by: 

a SILi 5[map(a) = labelti)] 

Accuracy = , 

n 

where n is the number of data instances, label(i) and 
Ci are the actual cluster label and the predicted label 
of a data instance i, respectively. <5(-) is an indicator 
function and map(ci) is a permutation function that 
maps cluster c, to a category label. The best matching 
can be found using Hungarian algorithm [3]. 

6.2 Methods and Parameters All the experimen- 
tal results reported in the following sections were the av- 
erage over 10 trials. We chose Gaussian kernel function 
as the similarity function for all the methods. The band- 
width a was chosen based on the width of the neighbor- 
hood information for each dataset. For Nystrom, KASP, 
and LSC, the eigenspace was created from the normal- 
ized Laplacian L = D~ X I 2 LD~ X I 2 since the normalized 
one is reported to be better [T5]. Methods using the 
nearest neighbor graph chose ki = 10 as the number of 
nearest neighbor in building the similarity graph. 

The followings are the detailed information regard- 
ing the experiments for each method: 

fc-means: all the approximate methods used k- 
means to cluster the data in the embedding. The 



Table 1: Complexity comparisons of all approximate spectral clustering methods, n, d, s, kpp, k is the number of 
instances, features, representatives, random projection columns, and the number of clusters, respectively. 



Method 


Sampling 


Affinity matrix 


Embedded space 


fc-means 


Nystrom 
KASP 
LSC 
CESC 


0(1) 
0(tdsn) 
0(1) 
N/A 


O(dsn) 
0{ds 2 ) 
0(dsn) 
0(dn \ogn) 


0(s 3 + sn) 
0(s 3 ) 

0(s 3 + s 2 n) 
0{k RP n) 


0(tk 2 n) 
0(tk 2 s) 
0(tk 2 n) 
0(tkkppn) 



Matlab build-in function 'kmeans' was used. The 
number of replications was 5 and the maximum number 
of iterations was 100. The 'cluster' option (i.e. cluster 
10% of the dataset to choose initial centroids) was used. 

Spectral clustering: we implemented in Matlab 
the algorithm in [20] . Since it is not possible to do 
the eigen decomposition of the Laplacian matrix in 
fully connected graph for large datasets, a fci-nearest 
neighbor graph was built and the sparse function 'eigs' 
was used to find the eigenspace. 

Nystrom: we used the Matlab implementation 
of Chen et. al [3] which is available online at 
http://alumni.cs.ucsb.edu/~wychen/sc.html. 

KASP: we implemented in Matlab the algorithm 
in [31] and used fc-means to select the representative 
centers. 

LSC: we used the number of fci = 10 for building 
the sparse matrix Z. In [4], the representatives can be 
chosen by randomly sampling or by fc-means. Since the 
random selection was preferred by the authors and had 
a better balance between running time and accuracy, we 
only used this option in the experiments. 

CESC: the algorithm was implemented in Matlab. 
The number of random vectors fcpp = 50 was chosen 
throughout the experiments. We used the Koutis's 
CMC solver [15] as the nearly linear time solver for 
creating the embedding. It is used for symmetric 
diagonally dominant matrices which is available online 
at http://www. cs. emu. edu/~jkoutis/cmg. html. 

6.3 An example A synthetic dataset featured the 
data clusters in the shapes of a phrase 'Data Mining' 
as in Figure [T] It has 2,000 data points in 10 clusters. 
We applied CESC, Nystrom, KASP, and LSC on this 
dataset. The number of representatives was 500 which 
was 25% of the data. The results are shown in Figure Q] 
In the figures of Nystrom, KASP, and LSC, the red dots 
are the representatives selected in their corresponding 
methods. 

It can be seen from the results the weakness of 
sampling-based approximate methods and the strength 
of CESC. Although the number of representatives was 
large enough (25% of data), it did not completely 



capture the geometry structures of all clusters and thus 
there were splits in some characters which a part of the 
character was considered closer to other character due to 
the structure of the representatives. CESC on the other 
hand clustered all data points correctly since it used all 
the data information. The exact spectral clustering also 
clustered the dataset correctly. 

6.4 Real Datasets We tested all the four methods 
in several real datasets with various sizes obtained from 
the UCI machine learning repository [8]. The details 
of all datasets are in Table [2] For all datasets, we 
normalized them so that all features had mean and 
standard deviation 1. 

Regarding the number of representatives, it cannot 
be too small to truely represent the data or too big to 
significantly slower the methods. For the first 3 small 
datasets (Segmentation, Spambase, and Musk), we used 
20% of the data as the representatives. Medium sizes 
Pen Digits and Letter Rec used 10% of the data as the 
representatives. For the big size dataset Connect4, we 
only chose 5,000 as the representatives. It is less than 
10% of data in ConnectL Since the computational 
time for large datasets will be very expensive for the 
sampling based methods if we use a high number of 
representatives, the percentage of the representatives 
will be less in larger datasets. 

Tables [3] and |4] show the clustering results in ac- 
curacy (percentage) and running time (second) for all 
the datasets. Considering the accuracy, CESC outper- 
formed all the sampling based approximate methods in 
most of datasets although the number of representatives 
they used was high enough. Considering the computa- 
tional time, CESC was also the fastest method in all 
datasets. 

From the complexity analysis in Section I5~T1 we can 
see that the bottleneck of CESC is the total running 
time of graph creation and fc-means steps. This is 
clearly shown in the results in Table [5] which presents 
the details in percentage of the running time of CESC 
for each dataset. The running time of the embedding 
step was dominated by the other two steps. The 
advantage is that there have been many studies in fast 
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(c) LSC (d) CESC 

Figure 1: Approximate spectral clustering methods using Nystrom, KASP, LSC, and CESC. This shows the 
weakness of approximate methods based on sampling. The red dots are the representatives in Nystrom, KASP, 
and LSC. CESC and exact spectral clustering can cluster the two datasets correctly. 



Table 2: UCI Datasets. 



Dataset 


Instances 


Features 


Classes 


Description 


Segmentation 


2,100 


19 


7 


Image segmentation 


Spambase 


4,601 


57 


2 


Spam prediction 


Musk 


6,598 


166 


2 


Molecule prediction 


Pen Digits 


10,992 


16 


10 


Pen-based recognition of handwritten digits 


Letter Rec 


20,000 


16 


26 


Letter recognition 


Connect4 


67,557 


42 


3 


The game of connect-4 



Table 3: Clustering accuracy (percentage). CESC outperformed other methods in most of the datasets. 



Dataset 


KASP 


Nystrom 


LSC 


CESC 


Segmentation 


74.5% 


58.3% 


73.2% 


78.9% 


Spambase 


60.8% 


82.7% 


97.6% 


100% 


Musk 


81.3% 


50.6% 


63.2% 


97.2% 


Pen Digits 


83.4% 


74.8% 


80.1% 


77.5% 


Letter Rec 


52.8% 


39.2% 


58.5% 


40.1% 


Connect4 


86.8% 


35.3% 


83.0% 


97.4% 



Table 4: Computational time (second). CESC was the fastest among all the approximate methods. 



Dataset 


KASP 


Nystrom 


LSC 


CESC 


Segmentation 


2.26 


6.25 


8.87 


2.06 


Spambase 


25.33 


28.26 


46.68 


16.32 


Musk 


178.24 


110.87 


154.09 


63.18 


Pen Digits 


65.33 


104.01 


119.04 


12.46 


Letter Rec 


236.43 


529.86 


395.47 


59.45 


Connect4 


3400.38 


10997.14 


3690.86 


1839.59 



Table 5: Time distribution for CESC. The bottleneck of 
the algorithm is the total running time of graph creation 
and /c-means steps. 



Datasets 


Graph 


Embedding 


A:- means 


Segmentation 


54.0% 


31.1% 


14.9% 


Spambase 


90.1% 


9.1% 


0.8% 


Musk 


92.6% 


6.9% 


0.5% 


Pen Digits 


51.1% 


33.0% 


15.8% 


Letter Rec 


36.5% 


17.0% 


46.5% 


Connect4 


97.1% 


2.7% 


0.2% 



fc-means and graph creation, or techniques to parallel 
them which we can make use of [3J. 

6.5 Parameter sensitivity As we have already 
mentioned, kpp is small in practise and there is not 
much differences between different datasets. [28] sug- 
gested that fcflp = 21nn/0.25 2 which is just about 500 
for a dataset of ten millions points. We conducted an 
experiment with different kpp in each dataset. The re- 
sults in Figure [2] show that the parameter kpp is quite 
small since the accuracy curve is flat when knp reaches 
a certain value (other datasets also have similar ten- 
dency) . It shows that our knp = 50 was suitable for the 
datasets in the experiments. Moreover, experiments in 
last sections show that the graph creation is the most 
dominant step and the running time of CESC is signifi- 
cantly faster than all the others. Therefore, kpp can be 
quite small and does not considerably affect the running- 
time of CESC. This is another advantage of CESC since 
it is not sensitive to the parameters in terms of both ac- 
curacy and performance. For sampling based methods, 
the selection of the number of representatives to balance 
between accuracy and speed is not trivial. 

6.6 Graph Datasets One more advantage of CESC 
over KASP, Nystrom, and LSC is that it can work di- 
rectly on the similarity graph while the others cannot 
since they have a sampling step on the original feature 
data. An experiment to show the scalability of the pro- 
posed method in large graphs was conducted in DBLP 



co-authorship network obtained from http://dblp.uni- 
trier.de/xml/ and some real network graphs obtained 
from the Stanford Large Network Dataset Collection 
which is available at http://snap.stanford.edu/data/. 
CA-AstroPh is a collaboration network of Arxiv Astro 
Physics; Email-Enron is an email communication net- 
work from Enron company; and RoadNet- TX is a road 
network of Texas in the US. 

All the graphs were undirected. The largest con- 
nected component was extracted if a graph data was 
not connected. We arbitrarily chose 50 as the number 
of clusters for all the datasets. The results using CESC 
are shown in Table |6l 

In case of graph data, the running time of &;-means 
was dominant the whole method. CESC took only less 
than 10 minutes to create an approximate embedding 
for the network graph of more than 1.3 million nodes. 

DBLP case study Since all the above graphs are too 
big to do a qualitative analysis, a subset of main data 
mining conferences in the DBLP graph was analyzed. 
We selected only authors and publications appearing in 
KDD, PKDD, PAKDD, ICDM, and SDM. Each author 
also need to have at least 10 publications and his/her co- 
authors also need to have such minimum publications. 
This selected only authors who published highly in ma- 
jor data mining conferences and collaborated with the 
similar kind of co-authors. Then the biggest connected 
component of the graph was extracted. The final graph 
has 397 nodes and 1,695 edges. 

CESC was applied to the subgraph with k = 50 
clusters. Since researchers have collaborated and moved 
from research groups to research groups overtime, some 
clusters are probably a merge of groups caused by the 
collaborations and movings of prominent researchers. 
However, the method can effectively capture clusters 
representing many well known data mining research 
groups in CMU, IBM Research Centers (Watson and 
Almaden), University of California Riverside, LMU 
Munich, University of Pisa, University of Technology 
Sydney, Melbourne University, etc. 



(a) Spambase (b) Musk (c) Pen Digits 

Figure 2: kpp can be quite small since the accuracy curve just slightly changes when kpp reaches a certain value. 

Table 6: The clustering time (second) for some network graphs. CESC took less than 10 minutes to create an 
approximate embedding for the network graph of more than 1.3 million nodes. 



Dataset 


Nodes 


Edges 


Embedding 


&;-means 


Total time (s) 


CA-AstroPh 


17,903 


197,001 


24.36 


50.62 


74.98 


Email-Enron 


33,696 


180,811 


27.33 


167.08 


194.41 


DBLP 


612,949 


2,345,178 


764.04 


4572.25 


5336.31 


RoadNct-TX 


1,351,137 


1,879,201 


576.62 


4691.53 


5268.15 



7 Discussion 

Von Luxburg, Radl, and Hein in their paper |29j 
showed that the commute time between two nodes on 
a random geometric graph converges to an expression 
that only depends on the degrees of these two nodes 
and does not take into account the structure of the 
graph. Therefore, they claimed that it is meaningless 
as a distance function on large graph. However, their 
results do not reject our work because of the following 
reasons. 

• Their proof was based on random geometric graphs 
which may not be the case in practise. The 
random geometric graph does not have natural 
clusters which clustering algorithms try to detect. 
Moreover, there were many assumptions for the 
graph so that their claim can hold. 

• Their experiments showed that the approximation 
becomes worse when the data has cluster struc- 
ture. However, the condition for an unsupervised 
distance-based technique can work well is the data 
should have a cluster structure so that the separa- 
tion based on distance is meaningful. We believe 
that many real datasets should have cluster struc- 
tures in a certain degree. 

• Our experiments show that CESC had a good 
approximation to spectral clustering and thus is not 
meaningless in several real datasets. It shows that 



approximate commute time embedding method can 
still be potential for using as a fast and accurate 
approximation of spectral clustering. 

As already mentioned in the experiments of real 
feature data and graph data, CESC has the bottleneck 
at the creation of the nearest neighbor graph and k- 
means algorithm. The cost to create the embedding is 
actually very small comparing to the whole cost of the 
algorithm. Once we have the embedding, we can choose 
any fast partition or hierarchical clustering techniques 
to use on that. [3] proposed methods to improve the 
cost of creating the nearest neighbor graph and fc-means 
in both centralized and distributed manners. Therefore, 
we believe CESC can be improved a lot more using these 
techniques. However, it is beyond the scope of this work. 

8 Conclusion 

The paper shows the clustering using approximate com- 
mute time embedding is a fast and accurate approxima- 
tion for spectral clustering. The strength of the method 
is that it does not involve any sampling technique which 
may not correctly represent the whole dataset. It does 
not need to use any eigenvector as well. Instead it uses 
the random projection and a linear time solver which 
guarantee its accuracy and performance. The exper- 
imental results in several synthetic and real datasets 
and graphs with various sizes show the effectiveness of 
the proposed approaches in terms of performance and 



accuracy. It is faster than the state-of-the-art approx- 
imate spectral clustering techniques while maintaining 
better clustering accuracy. The proposed method can 
also be applied directly to graph data. It takes only less 
than 10 minutes to create the approximate embedding 
for a network graph of more than 1.3 million nodes. 
Moreover, once we have the embedding, the proposed 
method can be applied to any application which utilize 
the commute time such as image segmentation, anomaly 
detection, and collaborative filtering. 

In the future, techniques to avoid the bottleneck of 
CESC including the acceleration of the graph creation 
and fc-means will be investigated. Moreover, though the 
analysis and experimental results show that CESC and 
spectral clustering have quite similar clustering ability, 
a deeply theoretical analysis need to be done to examine 
the strength and weakness of each method against the 
other. 
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